正则替换构造headers
在写爬虫构建请求的时候,不可避免地要添加请求头( headers ),一般通过添加 user-agent 中的数据实现,但是如果数据很多就会比较麻烦
请求头范例¶
例如以下user-agent,如果要构造成字典形式,需要逐行添加逗号,引号等,采用以下两种方法可以快速构造请求头
age: 21 cache-control: no-cache,no-store,private cdn-ip: 2408:8752:300:6:0:1:2:11 cdn-source: baishan cdn-user-ip: 2408:84ef:10:b3d0:c176:8470:c886:c23 content-encoding: gzip content-type: text/html; charset=GBK date: Mon, 16 Mar 2020 13:59:17 GMT expires: Mon, 16 Mar 2020 14:00:16 GMT server: nginx status: 200 vary: Accept-Encoding x-cache-remote: HIT x-content-from: netease x-ser: BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC75_lt-hunan-yueyang-2-cache-4
使用Sublime正则替换¶
打开sublime,使用替换功能,在需要替换的方框内输入
(.*?): (.*)
替换成
"$1": "$2",
其中
"$1 ", "$2"
分别表示匹配组,即匹配成功的原文
使用Python正则替换¶
当然可以直接写程序实现
# 引入正则 import re # 原始的user-agent header = ''' age: 21 cache-control: no-cache,no-store,private cdn-ip: 2408:8752:300:6:0:1:2:11 cdn-source: baishan cdn-user-ip: 2408:84ef:10:b3d0:c176:8470:c886:c23 content-encoding: gzip content-type: text/html; charset=GBK date: Mon, 16 Mar 2020 13:59:17 GMT expires: Mon, 16 Mar 2020 14:00:16 GMT server: nginx status: 200 vary: Accept-Encoding x-cache-remote: HIT x-content-from: netease x-ser: BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC75_lt-hunan-yueyang-2-cache-4 ''' # 构造空字符串 headers = '''''' # 正则替换 m = re.findall(r'(.*?): (.*)',header) for i in m: text = f'''"{i[0]}": "{i[1]}",''' if i != m[-1] : headers = headers + text + '\n' else : headers = headers + text # 输出结果 print(headers)
输出
"age": "21", "cache-control": "no-cache,no-store,private", "cdn-ip": "2408:8752:300:6:0:1:2:11", "cdn-source": "baishan", "cdn-user-ip": "2408:84ef:10:b3d0:c176:8470:c886:c23", "content-encoding": "gzip", "content-type": "text/html; charset=GBK", "date": "Mon, 16 Mar 2020 13:59:17 GMT", "expires": "Mon, 16 Mar 2020 14:00:16 GMT", "server": "nginx", "status": "200", "vary": "Accept-Encoding", "x-cache-remote": "HIT", "x-content-from": "netease", "x-ser": "BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC75_lt-hunan-yueyang-2-cache-4",